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Abstract 

In this paper we address the problem of predicting a time series using the ARMA (autoregressive 
moving average) model, under minimal assumptions on the noise terms. Using regret minimization 
techniques, we develop effective online learning algorithms for the prediction problem, without assum- 
ing that the noise terms are Gaussian, identically distributed or even independent. Furthermore, we 
show that our algorithm's performances asymptotically approaches the performance of the best ARMA 
model in hindsight. 



1 Introduction 



A time series is a sequence of real-valued signals that are measured at successive time intervals. Autore- 
gressive (AR), moving average (MA), and autoregressive moving average (ARMA) models are often used 
for the purpose of time-series modeling, analysis and prediction. These models have been successfully 
used in a wide range of applications such as speech analysis, noise cancelation, and stock market analysis 
([Ham94, BJR94, SS05, BD09]). Roughly speaking, they are based on the assumption that each new signal 
is a noisy linear combination of the last few signals and independent noise terms. 

A great deal of work has been done on parameter identification and signal prediction using these models, 
mainly in the "proper learning" setting, in which the fitted model tries to mimic the assumed underlying 
model. Most of this work relied on strong assumptions regarding the noise terms, such as independence and 
identical Gaussian distribution. These assumptions are quite strict in general and the following statement 
from IITho94ll is sometimes quoted: 



Experience with real-world data, however, soon convinces one that both stationarity and Gaus- 
sianity are fairy tales invented for the amusement of undergraduates. 

In this paper we argue that these assumptions can be relaxed into less strict assumptions on the noise terms. 
Moreover, we offer a novel approach for time series analysis and prediction — an online learning approach 
that allows the noise to be arbitrarily or even (to some extent) adversarially generated. The goal of this 
paper is to show that the new approach is more general, and is capable of coping with a wider range of time 
series and loss functions (rather than only the squared loss). 
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1.1 Summary of results 



We present and analyze two online algorithms for the prediction problem, one designed for general convex 
loss functions and the other for exp-concave ones. Each of these algorithms attains sublinear regret bound 
against the best ARMA prediction in hindsight, under weak assumptions on the noise terms. We apply 
our results to the most commonly used loss function in time series analysis, the squared loss, and achieve 
a regret bound of O (log^(T)) against the best ARMA prediction in hindsight. Finally, we present an 
empirical study that verifies our theoretical results. 



1.2 Related work 

In standard time series analysis, the squared loss is usually considered and the noise terms are assumed to be 
independent with bounded variance and zero-mean. In this specific setting, one can assume without loss of 
generality that the noise terms have identical Gaussian distribution (see IIHam94[ IBJR94[ IBDOQI for more 
information). This allows the use of statistical methods, such as least squares and maximum likelihood 
based methods, for the tasks of analysis and prediction. However, when different loss functions are consid- 
ered these assumptions do not hold in general, and the aforementioned methods are not applicable. We are 
not aware of a previous approach that tries to relax these assumptions for general convex loss functions. We 
note that there has been previous work which tries to relax such assumptions for the squared loss, usually 
under additional modelling assumptions such as f-distribution of the noise (e.g., IIDES891 [TWVBOOl ). We 
emphasize that the independence assumption is rather strict and previous works that relax this assumption 



usually offer specific dependency model, e.g., as proposed by |Eng82 1 for the ARCH model. 

Furthermore, an online approach that relies on regret minimization techniques was never considered for 
ARMA prediction, and hence regret bounds of the type we are interested simply do not exist. Yet, results 
on the convergence rate of the coefficient vectors do exist, and regret bounds can be derived from these 
results. E.g., in IIDSC06II such results are presented, and a regret bound of O (log^(r)) can be derived for 
the squared loss. We are not familiar of these kind of results for general convex loss functions. 



2 Preliminaries and model 
2.1 Time series modelling 

A time series is a sequence of signals, measured at successive times, which are assumed to be spaced at 
uniform intervals. We denote by Xt the signal measured at time t, and by et the noise term at time t. The 
AR(A;) (short for autoregressive) model, parameterized by a horizon k and a coefficient vector a G W^, 
assumes that the time series is generated according to the following model, where et is a zero-mean random 
noise term: 

k 

Xt = ^a,Xt-i + et. (1) 
1=1 

In words, the model assumes that each Xt is a noisy linear combination of the previous k signals. A 
more sophisticated model is the ARMA(/i;, q) (short for autoregressive moving average) model, which is 
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parameterized by two horizon terms k, q and coefficient vectors a € and P e M'. This model assumes 
that Xt is generated via the formula: 

k q 

Xt = Y, ^i^t-i + Pi^t-i + et, (2) 

i=l i=l 

where again are zero-mean noise terms. Sometimes, an additional constant bias term is added to the 
equation (to indicate constant drift), but we will ignore this for simplicity. Note that the AR(A;) model is a 
special case of the ARMA(fc, q) model, where the j3i coefficients are all zero. 

2.2 The online setting for ARMA prediction 

Online learning is usually defined in a game-theoretic framework, where the data, rather than being chosen 
stochastically, are chosen arbitrarily, possibly by an all-powerful adversary with full knowledge of our 
learning algorithm (see for instance IICBL061 ). In our context, we will describe the setting as follows: First, 
some coefficient vectors (a, /?) are fixed by the adversary. At each time point t, the adversary chooses 
and generates the resulting signal Xt using the formula in Equation |2] We emphasize that (ce, /?) and the 
noise terms are not revealed to us at any time point. 

At iteration t, we need to make a prediction Xt for the signal, after which the real signal Xt is revealed, 
and we suffer a loss denoted by £t{Xt,Xt). Our goal is to minimize the sum of losses over a predefined 
number of iterations T. A reasonable benchmark is to try to be not much worse than the best possible 
ARMA model. More precisely, we let 




denote the loss at time t of the (conditionally expected) prediction given by an ARMA model with some 
coefficients (a, /3). We then define the regret as 

T T 

RT = y^k {Xt, Xt) - min V it {Xt, Xt{a, /3)) . (4) 

t=l t=l 

We wish to obtain efficient algorithms, whose regret grows sublinearly in T, corresponding to an average 
per-round regret going to zero as T increases. Q 

A major challenge in our setting is that the noise terms {et}^i are unknown. As a result, we cannot use 
existing online convex optimization algorithms over the space of coefficient vectors (q, /3). Moreover, even 
if we are given some (a, we cannot generate a prediction Xt using the ARMA model. This lack of 
information makes it also hard to compute the best coefficient vectors in hindsight, and hence competing 
against the best ARMA model is ill-defined in this case. 

'The iterations in which t < are usually ignored since we assume that the loss per iteration is bounded by a constant, this 
adds at most a constant to the final regret bound. 
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2.3 Our assumptions 

Throughout Section |3] we assume the following: 

1. The noise terms are stochastically and independently generated, each from a zero-mean distribution 
which might be chosen adversarially (up to the assumptions below). In Section |4] we show how 
to relax this assumption to adversarial noise. Also, we assume that ]E[|e(|] < Mmax < oo and 
E[£t {Xt,Xt-et)]<^forMt. 

2. The loss function £t is Lipshitz continuous for some Lipshitz constant L > 0. This is a standard 
assumption and it holds in particular for the squared loss, as well as for other convex loss functions, 
with compact domain. 

3. The coefficients Ui satisfy \ai\ < c for some c G M. This assumption is also standard, and needed in 
general for the decision set (defined in Subsection 13. Il l to be bounded. We assume that c = 1 without 
loss of generality. 

4. The coefficients /3j satisfy Yli=i IftI < 1 — e, for some e > 0. 

5. The signal is bounded (by constant which is independent of T). Without loss of generality we assume 
that \Xt\< 1 for all t. 

3 Online time series prediction 

As said before, we cannot use existing online convex optimization algorithms over the space of coefficient 
vectors (a, /?) since the noise terms are unknown to us at any stage. Instead, we use an improper learning 
approach, where our predictions at each time point will not come from an ARMA model that tries to mimic 
the underlying model. More specifically, we fix some m G N, and at each time point t, we choose an 
(m + A;) -dimensional coefficient vector 7 € R'"+'^ and predict by Xt{'^) = X^ilLl^ liXt-i- It follows that 
our loss at iteration t is determined by the loss function 



This can be seen as an AR model with horizon {m + k). This leads to one of our key results: we can learn 
ARMA(A;, q) model using AR(m + k) model, for a properly chosen value of m. We quantify this result in 
Theorem 13. ll in terms of regret. 

3.1 Algorithm parameters definition and calculation 

Before presenting the algorithm and stating our main theorem, we need to define the following parameters. 
The decision set /C is the set of candidates ((m + A;) -dimensional coefficient vectors) we can choose from 
at each iteration; it is defined as 




(5) 
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Intuitively, the structure of K, follows from Assumptions 3-4 on a and j3, which restrict our improper 
learning variable 7. We denote by D the diameter of /C, and bound: 

D= sup II71 - 72II2 = y/2{m + k). (6) 

Next, we denote by G the upper-bound of ||V^™(7)|| for all t and 7 G /C. This parameter depends on the 
loss function considered, and its computation is done accordingly. E.g., for the squared loss we get that 
G = 2\/m + kD, relying on Assumption 5. Finally, we denote by A the exp-concavity parameter of the 
loss functions {£Y^}f^^, i.e., it holds that e"'^'^*"^'^) is concave for allti This parameter is relevant only 
for exp-concave loss functions, and its computation is also done according to the loss function considered. 
It can be shown that A = when the squared loss is considered. 



3.2 ARMA Online Newton Step (ARMA-ONS) 

Algorithm [T] shows how to choose 7* in each iteration, when the loss functions {^™}^]^ are assumed to 
be A-exp-concave in 7. The notation 11^* refers to the projection onto JC in the norm induced by At, i.e., 

n^*(y) = argmin^.g/c(y - Aiv - x). 

Algorithm 1 ARMA-ONS(k,q) 
1: Input: ARMA order k,q; learning rate t]; an initial {m + k) x {m + k) matrix Aq. 
2: Set m = g • logi_^ ((TLMmax)"^)- 
3: Choose 7^ G /C arbitrarily. 
4: for t = 1 to (r - 1) do 

5: Predict Xt(7*) = ES=f7*^*-^- 
6: Observe and suffer loss ^™ (7*). 

7: Let Vt = VC(7*)> update At ^ At^i + VtVj 

8: Set7*+i^n;^'(7*-iA,-iVt) 

9: end for 



In case the dimension {m + k) of At is large, we note that its inverse can be efficiently re-computed after 
each update using the Sherman-Morrison formula. 



For Algorithm[T]we can prove the following: 

Tiieorem 3.1. Let k,q > 1, and set Aq = elm+k> £ = 'if^rfp, f] — \ min{4GD, A}. Then, for any 
data sequence {Xt^'^^i that satisfies the assumptions from Section Algorithm\l\ generates an online 
sequence {'y^}f^i, for which the following holds: 

p^iTil') - mmpK[ft{a,P)] = O (^(^GD + log(r)) . (7) 
^It is easy to show that every exp-concave function is convex, the opposite is not correct. 
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Remark: The expectation is necessary since the noise terms are unknown. Also, obtaining a high prob- 
ability bound on the regret is possible but requires additional assumptions on the noise process such as 
boundedness or light tail. 

Proof. Intuitively, Theorem 13. II states that we can have a regret as low as the best ARMA(/c,(7) model, 
using only an AR(m + k) model. The proof consists of two steps. In the first step we bound the regret 
suffered by an AR(m+fc) prediction using familiar techniques of online convex optimization. In the second 
step we bound the distance between the AR(m + k) loss function and the ARMA(A;, q) loss function, 
using a chain of bounds and inequalities. Integrating both steps yields the requested regret bound for the 
ARMA(A;, q) loss function. 

Step 1: Relying on the fact that the loss functions {(-Y^}J^i are A-exp-concave, we can guarantee that 

T T 

Y^ril') - min j;C(7) = 0((gD + -) log(r)), 
t=i t=i 

using the Online Newton Step (ONS) algorithm, presented in IIHAK07II . 
Step 2: Define recursively 

k q 
i=l i=l 

with initial condition (a, /?) = Xi. We then denote by 

/r(a,/3)=^t(Xi,X-(a,/3)) (8) 

the loss suffered by the prediction Xf°(a, /3) at iteration t. From this definition it follows that Xj°^(a, /3) 
is of the form Xf°(a, /?) = X]i=i (5)Xt-i for some appropriate coefficients Cj(a, /?). The motivation 
behind the definition of fj^ follows from the need to replace ft with a loss function that fits the full 
information online optimization model (no unknown parameters). We set m € N, and define 

k q 

Xr(a, /?) = ^ a.Xt_, + ^ ft - X^-'ia, /3)) , 

2 = 1 1 = 1 

with initial condition X™(a, /3) = Xf for all t and m < 0. We denote by 

/r(a,/3)=^t(Xi,Xr(a,/3)) (9) 

the loss suffered by the prediction X'^{a,f3) at iteration t. The motivation here is simple: it is easier 
to generate predictions using only the last (m + k) signals, and the distance between the loss function is 
relatively small. Now, let 

T 

(a*,^*) = argminVE[/t(a,/3)] (10) 
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denote the best ARM A coefficient in fiindsigfit for predicting the signal {Xt}f^i. 
Then, from Lemma |3^ stated and proven below, we have that 

T T 
min^C(7)<^/rK,r), 



t=i 



and it follows that 



t=i t=i 

From Lemma l33] below we know that 

T T 

j;E[/r(a*,r)]-^E[/r(a*,/3* 



t=i 



t=i 



0(1) 



fovm = q - logi_e {{TLM^fad j , which implies that 

T T 



t=i 



t=i 



Finally, from Lemma |3!4] below we know that 

T T 



t=i 



0(1) 



and thus 



^C(7*) - min^E [/,(«, /3)] = O ((ci? + log(T)) 
t=i t=i W J / 



Next, we prove the lemmas we used. 



□ 



Lemma 3.2. Let ^^(7), //"(a, /3) antif («*,/?*) as denoted in Equations^ l9l andUO\ Then, for all 
m G N and data sequence {Xt]J^i that satisfies the assumptions from Section \L3\ it holds that 

T T 



t=i 



Proof. Note that if we set 7* = Ci{a* ,(3*), we immediately get that 



t=i 
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Trivially, it always holds that 



which completes the proof. 



t=i 



t=i 



□ 



Lemma 3.3. Let f^{a, (3), //"(a, /?) and (a*, be as denoted in Equations ISl \9\andlTd\ Then, for any 
data sequence {Xt}f^i that satisfies the assumptions fi-om Section \Z3\ it holds that 



^E[/r(a^r)]-^E[/^(a^r 



t=i 



t=i 



o{i), 



if we choose m = q ■ logi_£ ( {TLM^-aa.^) 



Proof. We set t, and look at the distance between f^{a*, /3*) and fl^{a*, /3*) in expectation. We show by 
induction that 

E [\Xr {a\ n - Xr {a\ n \] < 2M^^, • (1 - e)f . 
For m = we have that X^ {a*, 13*) = Xf from the definition, and hence 

\X? {a\n - xr {a\n \<\Xt- xr {a\n \<\Xt- xr {a\n - et\ + \et\. 

Now, E [\et\] < Mjnax < oo for all t and E [\Xt — Xr (a*, P*) — et|] decays exponentially as proven in 
lemma 13.41 and hence the inductive basis holds for m = 0. Next, we prove that the inductive basis holds 
for m = 1, . . . , g — 1: 

\xria\n-xria\n\ 

2 = 1 1=1 

m q 



1=1 

m 



i=m+l 

q 



< ElA*l•l^t-.(«^r)-^^7H"^/3*)l+ E \m■\xr^ic^\n-xt\ 

i=l i=m+l 

(2) ^ m-^ (3)^ 

< Y.\^t\-2M^,,,-{l-e)— + E IA I-2AW < E 1^^1-2^^-- -(1-^) 

i=l i=m-\-l i=l 

m 

< 2Mi„ax-(l-e)^ ■ 



m — q 
1 



(1) is true from the triangle inequality and from the definition of XJ^ for m < 0. (2) is true from the 

rri — q 

inductive hypothesis on m. (3) is true since 1 < (1 — e) « for m = 1, . . . , g — 1. 
For the inductive step we assume that 



\Xi^ (a^ P*) - xr ia\n \ < 2M^ax •(!-£)■ 
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for g < /i < m and t < t, and prove that 

\xr (a^ n - xr n \ < 2m^,^ • (i - . 

Thus, 

\xr{a\n-xr{a*,n\ 
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i=l 



1=1 

9 



i=l 



< ^ 1/3*1 •2M^ax-(l 



<^|/3,*|-2M^ax-(l 



i=l 



< (1 - e) • 2Mi„ax •(!-£)' = 2M^ax ■ {I - e) " , 

which completes the induction. Recall that It is Lipshitz continuous for some Lipshitz constant L > from 
Assumption 2, and hence it follows that 

|E [/- (a^ r )] - E [fr (a^ nw = [it {x^xr (a^ n)] - e (x^, (a^ r ))] i 

< mt{Xuxr{a\n)-it{Xuxr{a\n)\]<L-n\xr{(y\n-xr{a\n\] 

< L ■ 2Mmax 

where the first inequality follows from Jensen's inequality. By summing the above for all t we get that 

T T 



J;E[/-(a^/3*)]-5^E[/^(a^/3^ 
t=l t=l 

Finally, choosing m = q ■ logi_£ ^(TLMmax)"^^ yields 

T T 

J2 E [/r (a*, r)] - ^ E [/r (a*, r )] 



< TL ■ 2Mi„ax • (1 - e) 



t=i 



0(1). 



□ 



Lemma 3.4. Let ft{a, /?), /f°(a, /?) ant/ (a*, /3*) denoted in Equations \3\ |8l antilTOl r/ie?i, /or anj 
i/ato sequence {Xt}J^^ that satisfies the assumptions from Subsection \2.3\ it holds that 



5^E[/r(a^r)]-5;E[/i(a^/3*)] 



t=l 



t=l 



0(1). 
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Proof. First, denote by {c/,P') the coefficient vectors that have generated the signal. Trivially, it holds that 

T T 

^ft{a',(5')=Y,£t{Xt,Xt-et). 
t=i t=i 

From Assumption 1, is independent of ei, . . . , et_i, and hence the best prediction available at time t will 

cause a loss of at least £t {Xt,Xt — et) in expectation. We can think of it in the following way: at time t, 
the online player has no previous information regarding the adversary's choice of e^. Since E [e^] = and 
it is convex, predicting the expected signal is the optimal policy of the online player at time t. It follows 
that (a*, P*) = (a', P'), meaning the best ARMA coefficients in hindsight are those that have generated 
the signal. 

Next, we show by induction that E — X'^ {a*,/3*) — et\] decays exponentially as t grows linearly. 
Without loss of generahty, we can assume that for t = 1, . . . , g we have that E [\Xt — X^° (a*, (3*) — et |] < 
p for some p > 0, as the inductive basis. Now, for the inductive step we assume that 



E[|X,-X~(a*,r)-e,|] <p-(l-e)i 



for g < T < t, and prove that 
Thus, 

E[\Xt-Xria*,n-et[ 



E [\Xt - X^ K, P*) - et\] < p • (1 - £)5 . 



E 



= E 



i=l 



i=l 



i=l 



i=l 



1=1 



< ■ E [\X^i ia\ P*) - Xt-i - et-i\] 



1=1 

t-q 



< ^|^*|.p.(l-e)^<^|^;|.p.(l-e)V<(l-£).p(l-£)^=p.(l-e); 

i=l i=l 

which ends the induction. Recall that it is assumed to be Lipshitz continuous for some constant L > 0, and 
hence it follows that 

|E [/r {a\ P*)] - E \ft (a^ ^ )] | = |E [^^ (X^ , X^ {a\ ^ ))] - E [it (Xt, Xt - et)] \ 
= |E [it {Xt, X^ (a*, P*)) - it {Xt, Xt - et)] | 

< E [[it {Xt, X^ {a*, P*)) - it {Xt, Xt-et)[]<L-E [[Xt - X^ {a\ P*) - et[] 

< pL-{l-e)^. 

Finally, summing over all iterations yields 

Y E [/r p*)] -jz^ift n] =o{i). 



t=i 



t=i 



□ 
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Remark: In Lemma [34] we assume here that pqL = O (1). Otherwise, an element of O (pqL) is added to 
the regret bound in Theorems 13.11 which does not affect the asymptotic result. 

3.3 ARMA Online Gradient Descent (ARMA-OGD) 

We now turn to present a different algorithm for choosing 7* at each time point. This algorithm is applicable 
to general convex loss functions, as well as to exp-concave ones. It is computationally simpler but has a 
somewhat worse theoretical (and empirical) performance compared to the previous one, when considering 
an exp-concave loss function. The notation ILfc refers to the Euclidean projection onto /C, i.e., Hjciu) = 
argmina;g/c Wv - a^lb ■ 

Algorithm 2 ARMA-OGD(k,q) 
1: Input: ARMA order k,q. Learning rate t]. 
2: Set m = g • logi_, ((TLAW)"^)- 

3: Choose 7^ G /C arbitrarily. 
4: for t = 1 to (T - 1) do 

5: Predict X,(7*) = E:^f7*^*-.- 
6: Observe Xt and suffer loss ^™(7t). 
7: Let Vt = V£J^(7*) 

8: Set7*+i^nyc(7*- JVt) 
9: end for 



For Algorithm E] we can prove the following: 

Theorem 3.5. Let k,q > 1, and set r] = Then, for any data sequence {Xt}J^^ that satisfies the 

assumptions from Section \23\ Algorithm^generates an online sequence {7*}^]^, /or which the following 
holds: 

T T 

^C(7*) - min J^E [/i(a,/3)] = O (gdVt) . (11) 

t=l " i=l 

The proof of this theorem is very similar to the proof of Theorem 13.11 albeit plugging into our framework 
the Online Gradient Descent (OGD) algorithm of IIZin031 rather than the Online Newton Step algorithm. 

4 Additional results 

In this section we present an analysis for the case when the noise terms are allowed to be adversarial, and 
also an application of Theorem 13 . 1 1 for squared loss. 
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4.1 Adversarial noise 



The results presented in Theorems 13.11 and | 3 . 5 I rely on the assumptions that the noise terms are independent 
and zero-mean. Under these assumptions, the best coefficient vectors in hindsight are those that have 
generated the signal. However, if we allow the noise terms to be adversarially generated (the adversary 
chooses et at time t with no limitations), the best coefficient vectors in hindsight are not necessarily the 
ones used for generating the signal. For this case we have the following theorem: 

Theorem 4.1. Denote by (a' ,f3') the coefficient vectors that have generated the signal, and assume that 
{Xt}'^^^ satisfies Assumptions 2-5 from Section \T3\ when the noise terms are allowed to be chosen adver- 
sarially. Then, for exp-concave loss functions Algorithm\l\generates an online sequence {'y^}J^i, for which 
the following holds: 

and for convex loss functions, Algorithm^generates an online sequence {'y^}f^i, for which the following 
holds: 

T T 

t=l t=l 

Notice that we compare here the total loss suffered by our algorithms to the expected loss suffered by 
ARMA prediction with the coefficient vectors that have generated the signal, and not to the expected loss 
of the best ARMA prediction in hindsight. Nevertheless, this theorem captures interesting cases (e.g., 
correlated noise), in which traditional approaches fail to perform properly. The proof of this theorem 
resembles the proof of Theorem 13.11 with the modification of plugging (a', /?') into Lemmas 13.31 and [T4l 
instead of {a*, (3*). 

4.2 Application of Theorem [Xl] to squared loss 

As already mentioned, the squared loss is the most commonly used loss function in time series analysis. It 
is defined as lt{Xt,Xt) = {Xt — Xt)'^ for prediction Xt and signal Xt. In our case, the predictions come 
from an AR model with horizon {m + k), and hence our loss at time t is {Xt - YT=i' l\^t-i? , when 
I are generated using Algorithm [T] Substituting the values of G, D and A, as defined and computed 
in Subsection l3.1l for the squared loss, yields the following result: 

T T 

V C(7*) - min V E [/^ (a, /?)] = O {k log (T) + q log^ (T)) . (12) 

— a,n ^ — ' 

t=l t=l 

This result impUes that the average loss suffered by Algorithm [U converges asymptotically to the average 
loss suffered by the best ARMA prediction in hindsight, under the assumptions from Section 123] In section 
|5]we empirically verify this theoretical result, under some different settings. 
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5 Experiments 



The following experiments demonstrate the prediction effectiveness of the proposed algorithms, under some 
different settings. We compare the performance to the ARMA-RLS algorithm, which was presented in 
IIDSC06i . In a few words, the ARMA-RLS is a "proper learning" algorithm — it tries to mimic the un- 
derlying model. It estimates the noise terms using a recursive least squares based method, and satisfies a 
prediction using these estimations and the previous signals. The ARMA-RLS does not assume noise sta- 
tionarity or ergodicity. We also benchmark the standard Yule-Walker estimation metho^E The results are 
displayed in the figures below. In all cases, the x-axis is time (number of samples), and the y-axis is the 
average squared loss. 



5.1 Experiments with artificial data 

In all experimental settings below we have averaged the results over 20 runs for stability. Also, we choose 
the order of our AR prediction to be m + /c = 10 in all settings. 



Setting 1. We started with a simple sanity check using Gaussian noise. We generated a stationary ARMA 
process using the coefficient vectors a = [0.6, —0.5,0.4, —0.4,0.3] and /3 = [0.3, —0.2], when the noise 
terms are uncorrected and normally distributed as AA(0, 0.3^). Note that since predicting the noise is im- 
possible, a perfect predictor will suffer an average error rate of at least the variance of the noise — 0.09 
in this setting. As can be seen in Figure [T(a)] the ARMA-ONS algorithm outperforms the other online al- 
gorithms due to its lower regret in this setting of exp-concave loss functions, and quickly approaches the 
performance of the perfect predictor. 



Setting 2. We generated the non-stationary ARMA process using the coefficient vectors /3 = [0.32, —0.2] 
and 

ait) = [-0.4, -0.5, 0.4, 0.4, 0.1] * (^) + [0.6, -0.4, 0.4, -0.5, 0.4] * (l - A) , 

i.e., the coefficient vectors change slowly in time. The noise terms are uncorrected and distributed uni- 
formly on [—0.5,0.5] (denoted as C/ni[— 0.5, 0.5]). In this setting, a perfect predictor will suffer average 
error rate of at least 0.0833, due to the variance of the noise. The motivation behind this setting is to 
demonstrate the effectiveness of the online algorithms in the non-stationary case, in which the coefficients 
change in time. This is especially important when dealing with real data time series, since the stationarity 
assumption is rather strict. In Figure |l(b)| we can see the clear advantage of our online algorithms. Here 
again, ARMA-ONS is superior to the other algorithms, despite it being less adaptive — as the theoretical 
bounds predict; see LHSOQJ for discussion of adaptivity of OGD vs. ONS. 



Setting 3. Here we consider the non-stationary ARMA process that is generated using two different sets 
of coefficient vectors. The first set is a = [0.6, —0.5, 0.4, —0.4, 0.3] and (3 = [0.3, —0.2], and it is used for 

^ Yule- Walker estimation method is offline. We use it as an online prediction method by a simple adaptation — we let it predict 
the signal at time t with the knowledge of the signal at times 1, . . . , t — 1. 



13 





(a) Setting 1. Sanity check 



(b) Setting 2. Slowly changing coefficients 
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(c) Setting 3. Abrupt change 



2000 4000 6000 8000 10000 

Time 



(d) Setting 4. Correlated noise 



Figure 1 : Experimental results for artificial data, all averaged over 20 runs. 



generating the signal at the first half of the iterations. The second set is a = [—0.4, —0.5, 0.4, 0.4, 0.1] and 
/3 = [—0.3, 0.2], and it is used for generating the signal at the second half of the iterations. The noise terms 



are uncorrelated and distributed Uni[— 0.5, 0.5]. In Figure [T(c)] we demonstrate the effectiveness of online 
algorithms in a scenario when the coefficients abruptly change. Here again, a perfect predictor will suffer 
average error rate of at least 0.0833, due to the variance of the noise. 



Setting 4. Consider an ARMA process that is generated using the coefficient vectors a = [0.11, —0.5] and 
/3 = [0.41, —0.39, —0.685, 0.1]. Each noise term is distributed normally, with expectation that is the value 
of the previous noise term, and variance 0.3^. I.e., the noise terms are positively correlated. In Figure [T(d)] 
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one can clearly see the robustness of online algorithms to correlated noise. Note that despite the correlativity 
introduced in this setting, ARMA-ONS achieves an average error rate that converges approximately to the 
variance of the noise — 0.09. 



5.2 Experiments with real data 

In this section we provide some preliminary results on real data time series, and show that for such data as 
well, our online learning approach is reasonably effective compared to existing approaches. For robustness, 
we consider time series from different fields. 




Figure 2: Experimental results for real data. 



The first time series is taken from the field of weather research. Each data point in this time series is 
the monthly average temperature of the sea surface, measured at a specific point. The data is taken from 
the Global Chmate Observing System (GCOS) website. Since we are dealing with a weather related time 
series, and considering the monthly average temperature, it is rather reasonable that the time series follows 
a certain pattern. As can be seen in Figure |2(a)| this pattern can be well learned using the ARMA model by 
all four algorithms. However, the results clearly indicate the superiority of online algorithms. 

The second time series is taken from the field of finance. Each data point in this time series is the daily 
return of the S&P 500 index. The data is taken from I Yahoo! Finance I The results in Figure [2(b)] indicate 
that the ARMA model is probably not a good model for predicting the returns of the S&P 500 index. A 
possible reason is that the ARMA model is not rich enough, i.e., knowing the history of returns is not 
sufficient for satisfying a good prediction. The fact that offline familiar methods also fail here, strengthens 
this claim. See Section[6]for further discussion about fitting a time series model for financial data. 
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6 Conclusion and discussion 



In this paper we developed a new approach for time series analysis — an online learning approach. Our 
main result in this paper is that one can predict time series as well as the best ARMA model, regardless of 
the loss function considered, under weak assumptions on the noise terms — zero mean distribution. This 
result is strengthened in light of the fact that the noise terms in the underlying model are unknown to us at 
any stage. We overcome this difficulty by using improper learning techniques. Additionally, we present an 
analytical extension of our approach to adversarially generated noise terms. The main powerful properties 
of the online approach, as pointed out in our work, are generality, simplicity and efficiency, in comparison 
to existing methods. 

There are three issues that remain for further research. First, in our analysis we assume that Yli=i lAI < 
1 — e for some e > 0, which seems to limit the freedom of the /? coefficients. This assumption appears 
sometimes in the literature (e.g. in GARCH models) and is a sufficient condition for the MA component 
to be causally invertible, yet not necessary. In our case, we believe that this assumption follows from our 
proof techniques and the results would still hold for any /? coefficients. Second, in Section |4] we present 
results in which the total loss suffered by our algorithms is compared to the expected loss suffered by 
ARMA prediction with the coefficient vectors that have generated the signal. Whereas competing against 
the best ARMA prediction under adversarial noise is impossible because of identifiability issues, it would 
be interesting to study intermediate setups such as correlated or adversarial noise to some extent. Third, 
the ARMA model is not compatible for any time series, as can be seen in Section 15.21 when a finance 
related time series is considered. However, | Eng82) showed that some finance related time series can be 
well predicted using the ARCH model and its expansions. Therefore, it would be interesting to generalize 
our work to other time series models, such as ARCH and ARIMA. 
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